Skip to content

SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations#21597

Merged
arthw merged 12 commits into
ggml-org:masterfrom
PMZFX:sycl-fix-multigpu-ram
May 14, 2026
Merged

SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations#21597
arthw merged 12 commits into
ggml-org:masterfrom
PMZFX:sycl-fix-multigpu-ram

Conversation

@PMZFX
Copy link
Copy Markdown
Contributor

@PMZFX PMZFX commented Apr 8, 2026

Summary

  • Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation in the SYCL backend
  • Replace sycl::free with zeMemFree for corresponding deallocations
  • Replace host-staged dev2dev_memcpy with direct Level Zero cross-device copy
  • Link against ze_loader for Level Zero API access
  • All changes include automatic fallback to original SYCL path if Level Zero is unavailable

Problem

On Intel multi-GPU systems, sycl::malloc_device triggers the xe kernel driver's DMA-buf/TTM export path (xe_gem_prime_export -> ttm_pool_alloc_page), which creates a 1:1 mirror of every VRAM allocation in system RAM. This causes system RAM to scale linearly with total VRAM allocated across GPUs, leading to OOM crashes during multi-GPU inference even when models fit entirely in VRAM.

Measured on dual Intel Arc Pro B70 (32GB each, 64GB total VRAM) with 64GB system RAM:

  • sycl::malloc_device 4 GiB = +4,112 MiB system RAM (1:1 mirror)
  • zeMemAllocDevice 4 GiB = +8 MiB system RAM (no mirror)

A 15.6 GiB Q4_K_M model consumed 60 GiB of system RAM during dual-GPU inference with sycl::malloc_device, causing repeated OOM crashes.

Solution

zeMemAllocDevice allocates GPU memory through Level Zero's SVM/P2P path instead of the DMA-buf/TTM path, avoiding the host memory staging entirely. SYCL kernels can read zeMemAllocDevice pointers with full interop, no compatibility issues.

Changes:

  • New static ggml_sycl_malloc_device() in ggml-sycl.cpp and ggml_sycl_free_device() in common.cpp that try Level Zero first, fall back to SYCL
  • Replaced 4 allocation sites: single-device buffer, split buffer, memory pool, overflow pool
  • Replaced 5 deallocation sites: buffer destructor, pool destructor, pool overflow, release_extra_gpu, and the pre-existing sycl::free in common.cpp
  • Updated dev2dev_memcpy to use zeCommandListAppendMemoryCopy for direct cross-device transfers

Test results

Dual Intel Arc Pro B70 (32GB each), AMD Ryzen 5 9600X, 64GB DDR5, Ubuntu 26.04, kernel 7.0, compute-runtime 26.09. Model: Qwen3.5-27B.

Q4_K_M, 48K context, dual GPU (-sm layer):

Metric Before After
Peak system RAM 60,034 MiB (100%), OOM crash ~6.7 GiB (10%), flat
pp48000 OOM crash 782 t/s
pp512 348 t/s 359 t/s
tg128 17.92 t/s 17.82 t/s

Q8_0, 32K context, dual GPU: 915 t/s, system RAM flat.

Single GPU: No regression. 467 t/s pp512, 17.12 tg128.

Correctness: Output is byte-for-byte identical between single and dual GPU with same seed (verified Q4_K_M, Q6_K).

Test plan

  • Single GPU inference (no regression)
  • Dual GPU pp512/tg128 (Q4_K_M, Q6_K, Q8_0)
  • Dual GPU large context (48K Q4_K_M, 48K Q6_K, 32K Q8_0)
  • System RAM stays flat during all dual-GPU tests
  • Correctness: single vs dual GPU output matches with fixed seed
  • Clean exit (no crash during cleanup/teardown) — pre-existing UR_RESULT_ERROR_INVALID_MEM_OBJECT in ggml_sycl_pool_host::~ggml_sycl_pool_host during teardown; reproduced identically on the commit before all changes, not a regression
  • Fallback path: builds and works without Level Zero

@PMZFX PMZFX requested a review from a team as a code owner April 8, 2026 01:06
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Apr 8, 2026
Copy link
Copy Markdown
Contributor

@arthw arthw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. don't use try .. cache in malloc/free memory function.
    It will add more cost.
    Just check the return value and call backup function.

Copy link
Copy Markdown
Contributor

@arthw arthw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will test it on windows.
I will feedback the result.

Thank you!

Comment thread ggml/src/ggml-sycl/common.cpp Outdated
SYCL_CHECK(
CHECK_TRY_ERROR(sycl::free(extra->data_device[i], *(streams[i]))));
bool freed = false;
try {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use new function to replace the duplicated code to free memory.
Handle the result by SYCL_CHECK(CHECK_TRY_ERROR()) which print out stack info.

Comment thread ggml/src/ggml-sycl/ggml-sycl.cpp Outdated
Comment thread ggml/src/ggml-sycl/ggml-sycl.cpp Outdated
Comment thread ggml/src/ggml-sycl/ggml-sycl.cpp Outdated

static void ggml_sycl_free_device(void *ptr, sycl::queue &q) {
if (!ptr) return;
try {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove try ... catch

Comment thread ggml/src/ggml-sycl/ggml-sycl.cpp Outdated

static void dev2dev_memcpy(sycl::queue &q_dst, sycl::queue &q_src, void *ptr_dst,
const void *ptr_src, size_t size) {
try {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The legacy code support memcpy between iGPU and dGPU.
System API only support between dGPUs.
So, check the dev's type before call ze API.
In case that dGPU to dGPU, use the new code.

Remove try... catch which is expensive.

Comment thread ggml/src/ggml-sycl/ggml-sycl.cpp Outdated
void * dev_ptr;
SYCL_CHECK(CHECK_TRY_ERROR(dev_ptr = (void *)sycl::malloc_device(
size, *stream)));
void * dev_ptr = ggml_sycl_malloc_device(size, *stream);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add SYCL_CHECK(CHECK_TRY_ERROR())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need add SYCL_CHECK(CHECK_TRY_ERROR() to print out the call stack when crash.

Comment thread ggml/src/ggml-sycl/ggml-sycl.cpp Outdated
*/
SYCL_CHECK(CHECK_TRY_ERROR(buf = (char *)sycl::malloc_device(
size, *stream)));
char * buf = (char *)ggml_sycl_malloc_device(size, *stream);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add SYCL_CHECK(CHECK_TRY_ERROR())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need SYCL_CHECK(CHECK_TRY_ERROR()

Comment thread ggml/src/ggml-sycl/ggml-sycl.cpp Outdated
Comment thread ggml/src/ggml-sycl/ggml-sycl.cpp Outdated
Comment thread ggml/src/ggml-sycl/ggml-sycl.cpp Outdated
@arthw
Copy link
Copy Markdown
Contributor

arthw commented Apr 8, 2026

@PMZFX
The windows build is not supported by this PR.
Please use following patch to support windows build.
The windows build can work well, but I can't test the performance.
Maybe someone can test the performance on windows.

diff --git a/ggml/src/ggml-sycl/CMakeLists.txt b/ggml/src/ggml-sycl/CMakeLists.txt
index f87835b3c..90a416505 100644
--- a/ggml/src/ggml-sycl/CMakeLists.txt
+++ b/ggml/src/ggml-sycl/CMakeLists.txt
@@ -39,6 +39,19 @@ if (WIN32)
         set(CMAKE_CXX_COMPILER "icx")
         set(CMAKE_CXX_COMPILER_ID "IntelLLVM")
     endif()
+    if(DEFINED ENV{LEVEL_ZERO_V1_SDK_PATH})
+        message(STATUS "LEVEL_ZERO_V1_SDK_PATH is set to: $ENV{LEVEL_ZERO_V1_SDK_PATH}")
+        set(LEVEL_ZERO_V1_SDK_PATH $ENV{LEVEL_ZERO_V1_SDK_PATH})
+        if(EXISTS "${LEVEL_ZERO_V1_SDK_PATH}")
+            target_include_directories(ggml-sycl PRIVATE "${LEVEL_ZERO_V1_SDK_PATH}/include")
+            set(LEVEL_ZERO_V1_SDK_LIB_PATH $ENV{LEVEL_ZERO_V1_SDK_PATH}/lib)
+        else()
+            message(FATAL_ERROR "Miss to detect folder ${LEVEL_ZERO_V1_SDK_PATH}, please install the Intel GPU Driver.")
+        endif()
+     else()
+        message(WARNING "LEVEL_ZERO_V1_SDK_PATH is NOT set")
+        message(FATAL_ERROR "Miss to detect ENV LEVEL_ZERO_V1_SDK_PATH, please install the Intel GPU Driver.")
+     endif()
 endif()
 
 macro(detect_and_find_package package_name)
@@ -96,7 +109,7 @@ target_compile_options(ggml-sycl PRIVATE "-Wno-narrowing")
 # Link against Level Zero loader for direct device memory allocation.
 # Avoids sycl::malloc_device triggering DMA-buf/TTM system RAM staging
 # in the xe kernel driver during multi-GPU inference.
-find_library(ZE_LOADER_LIB ze_loader HINTS ${ONEAPI_ROOT}/lib ENV LD_LIBRARY_PATH)
+find_library(ZE_LOADER_LIB ze_loader HINTS ${ONEAPI_ROOT}/lib ${LEVEL_ZERO_V1_SDK_LIB_PATH} ENV LD_LIBRARY_PATH)
 if(ZE_LOADER_LIB)
     target_link_libraries(ggml-sycl PRIVATE ${ZE_LOADER_LIB})
     message(STATUS "Level Zero loader found: ${ZE_LOADER_LIB}")

@PMZFX
Copy link
Copy Markdown
Contributor Author

PMZFX commented Apr 8, 2026

@arthw Thanks for the thorough review. I've pushed a follow-up commit addressing your feedback:

  • Removed all try/catch, replaced with upfront backend/device type checks (ggml_sycl_is_level_zero, ggml_sycl_is_dgpu)
  • Moved shared helpers to common.cpp/common.hpp to eliminate duplication
  • Added SYCL_CHECK(CHECK_TRY_ERROR()) for fallback free calls
  • Guarded dev2dev_memcpy L0 path to dGPU-to-dGPU only
  • Incorporated your Windows Level Zero SDK path patch in CMakeLists.txt

Let me know if anything else needs attention.

Copy link
Copy Markdown
Contributor

@arthw arthw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this PR involve the level zero API firstly. There are more issues to be considered.

  1. Build level zero API need to install the GPU driver (level zero running-time) in building server. In some CI, the building server is pure CPU(Xeon) machine. That will break the building of level zero API.
  2. Some SYCL memory features are on the way. like SYCL graph and SVM. These feature still need SYCL memory API.
  3. SYCL memory API is based on level zero memory API. Skip SYCL to call level zero API will lose some benefit of SYCL code.

Suggestion:

  1. define building parameter: GGML_SYCL_SUPPORT_LEVEL_ZERO in ggml/CMakeLists.txt
    refer to GGML_SYCL_GRAPH.
    default value is "ON"

  2. In code, use this macro (GGML_SYCL_SUPPORT_LEVEL_ZERO) to screen the all level-zero code/include. So that if it's off, the code can be built without installing level zero lib and headers.

  3. Define an ENV variable GGML_SYCL_ENABLE_LEVEL_ZERO in ggml-sycl.cpp, like GGML_SYCL_DISABLE_GRAPH. It will control in running time.

  4. SYCL backend memory APIs include two sub functions: SYCL and Level Zero.
    If GGML_SYCL_SUPPORT_LEVEL_ZERO = ON, it includes two branchs: SYCL and Level Zero. GGML_SYCL_ENABLE_LEVEL_ZERO is used to control the branch in running time.
    If GGML_SYCL_SUPPORT_LEVEL_ZERO = OFF, it includes one branchs: SYCL in code level.
    So, it won't appear that mix SYCL and Level Zero memory API usage in a session: only one style APIs are used. If malloc is fault, the code won't switch to another API.

  5. SYCL.md should be updated to guide for above new parameters and dependence of Intel GPU driver installation to build for level zero API usage.

How do you think?

Thank you!

Comment thread ggml/src/ggml-sycl/dpct/helper.hpp Outdated

static inline void *dpct_malloc(size_t size, sycl::queue &q)
{
try {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. remove try... catch.
  2. This code is duplicated with ggml-sycl.cpp. Suggest defining new function for ze memory.

Comment thread ggml/src/ggml-sycl/common.cpp Outdated
return sycl_down_blk_size;
}

bool ggml_sycl_is_level_zero(sycl::queue &q) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SYCL backend is designed to run on level-zero only.
No need to check the level-zero running time here.

Comment thread ggml/src/ggml-sycl/common.cpp Outdated
return q.get_backend() == sycl::backend::ext_oneapi_level_zero;
}

bool ggml_sycl_is_dgpu(sycl::queue &q) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to save the hardware info in initial stage.
Refer to:
ggml-sycl.cpp:

   info.devices[i].smpbo = prop.get_local_mem_size();

common.hpp:

  struct sycl_device_info {
    size_t  smpbo;
     ...
  }

Comment thread ggml/src/ggml-sycl/ggml-sycl.cpp Outdated
Comment thread ggml/src/ggml-sycl/ggml-sycl.cpp Outdated
// via xe_gem_prime_export, consuming system RAM equal to VRAM allocated.
// zeMemAllocDevice uses the SVM/P2P path with no host staging.
static void * ggml_sycl_malloc_device(size_t size, sycl::queue &q) {
if (ggml_sycl_is_level_zero(q) && ggml_sycl_is_dgpu(q)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define the malloc/free memory by ze into new functions.

Comment thread ggml/src/ggml-sycl/ggml-sycl.cpp Outdated
const void *ptr_src, size_t size) {
// Use Level Zero direct copy for dGPU-to-dGPU transfers.
// The legacy host-staged path supports iGPU-to-dGPU copies.
if (ggml_sycl_is_level_zero(q_dst) && ggml_sycl_is_dgpu(q_dst) && ggml_sycl_is_dgpu(q_src)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to check the level zero.

Comment thread ggml/src/ggml-sycl/ggml-sycl.cpp Outdated
void * dev_ptr;
SYCL_CHECK(CHECK_TRY_ERROR(dev_ptr = (void *)sycl::malloc_device(
size, *stream)));
void * dev_ptr = ggml_sycl_malloc_device(size, *stream);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need add SYCL_CHECK(CHECK_TRY_ERROR() to print out the call stack when crash.

Comment thread ggml/src/ggml-sycl/ggml-sycl.cpp Outdated
*/
SYCL_CHECK(CHECK_TRY_ERROR(buf = (char *)sycl::malloc_device(
size, *stream)));
char * buf = (char *)ggml_sycl_malloc_device(size, *stream);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need SYCL_CHECK(CHECK_TRY_ERROR()

@PMZFX
Copy link
Copy Markdown
Contributor Author

PMZFX commented Apr 8, 2026

@arthw Thanks for the additional suggestions on the build/runtime flag architecture. Pushed a new commit implementing your approach:

  • Added GGML_SYCL_SUPPORT_LEVEL_ZERO cmake option (default ON), all L0 code/includes wrapped in #ifdef. Checks for both the loader library and headers before enabling, so it degrades cleanly on systems without the L0 SDK.
  • Added GGML_SYCL_ENABLE_LEVEL_ZERO runtime env var (default 1) to control which API is used. No mixing of L0 and SYCL memory APIs within a session.
  • Added a startup check that verifies devices actually use the Level Zero backend before enabling L0 APIs. Auto-disables with a warning if they don't.
  • Removed the L0 code from dpct_malloc (it was dead code and still had the try/catch issue).
  • Updated SYCL.md with both new parameters.

Tested with both GGML_SYCL_SUPPORT_LEVEL_ZERO=ON and OFF builds, and with the runtime flag toggled both ways. Let me know what you think.

@HumerousGorgon
Copy link
Copy Markdown

Will this need a docs update with the new build variable?

Comment thread ggml/src/ggml-sycl/ggml-sycl.cpp Outdated
Comment on lines +738 to +742
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
dev_ptr = ggml_sycl_malloc_device(size, *stream);
#else
SYCL_CHECK(CHECK_TRY_ERROR(dev_ptr = (void *)sycl::malloc_device(size, *stream)));
#endif
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
dev_ptr = ggml_sycl_malloc_device(size, *stream);
#else
SYCL_CHECK(CHECK_TRY_ERROR(dev_ptr = (void *)sycl::malloc_device(size, *stream)));
#endif
SYCL_CHECK(CHECK_TRY_ERROR(dev_ptr = (void *)ggml_sycl_malloc_device(size, *stream)));

Comment thread ggml/src/ggml-sycl/ggml-sycl.cpp Outdated
Comment on lines +986 to +990
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
buf = (char *)ggml_sycl_malloc_device(size, *stream);
#else
SYCL_CHECK(CHECK_TRY_ERROR(buf = (char *)sycl::malloc_device(size, *stream)));
#endif
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
buf = (char *)ggml_sycl_malloc_device(size, *stream);
#else
SYCL_CHECK(CHECK_TRY_ERROR(buf = (char *)sycl::malloc_device(size, *stream)));
#endif
SYCL_CHECK(CHECK_TRY_ERROR(dev_ptr = (void *)ggml_sycl_malloc_device(size, *stream)));

Comment thread ggml/src/ggml-sycl/ggml-sycl.cpp Outdated
Comment on lines +1401 to +1405
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
ptr = ggml_sycl_malloc_device(look_ahead_size, *qptr);
#else
SYCL_CHECK(CHECK_TRY_ERROR(ptr = (void *)sycl::malloc_device(look_ahead_size, *qptr)));
#endif
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
ptr = ggml_sycl_malloc_device(look_ahead_size, *qptr);
#else
SYCL_CHECK(CHECK_TRY_ERROR(ptr = (void *)sycl::malloc_device(look_ahead_size, *qptr)));
#endif
SYCL_CHECK(CHECK_TRY_ERROR(ggml_sycl_malloc_device(look_ahead_size, *qptr)));

Comment thread ggml/src/ggml-sycl/ggml-sycl.cpp Outdated
Comment on lines +1433 to +1437
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
ggml_sycl_free_device(ptr, *qptr);
#else
SYCL_CHECK(CHECK_TRY_ERROR(sycl::free(ptr, *qptr)));
#endif
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
ggml_sycl_free_device(ptr, *qptr);
#else
SYCL_CHECK(CHECK_TRY_ERROR(sycl::free(ptr, *qptr)));
#endif
SYCL_CHECK(CHECK_TRY_ERROR(ggml_sycl_free_device(ptr, *qptr)));

Comment thread ggml/CMakeLists.txt Outdated
option(GGML_SYCL "ggml: use SYCL" OFF)
option(GGML_SYCL_F16 "ggml: use 16 bit floats for sycl calculations" OFF)
option(GGML_SYCL_GRAPH "ggml: enable graphs in the SYCL backend" ON)
option(GGML_SYCL_SUPPORT_LEVEL_ZERO "ggml: use Level Zero for device memory in SYCL" ON)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
option(GGML_SYCL_SUPPORT_LEVEL_ZERO "ggml: use Level Zero for device memory in SYCL" ON)
option(GGML_SYCL_SUPPORT_LEVEL_ZERO "ggml: use Level Zero API in SYCL backend" ON)

Comment thread ggml/src/ggml-sycl/common.hpp Outdated
Comment on lines +305 to +308
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
extern int g_ggml_sycl_enable_level_zero;
void ggml_sycl_free_device(void *ptr, sycl::queue &q);
#endif
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
extern int g_ggml_sycl_enable_level_zero;
void ggml_sycl_free_device(void *ptr, sycl::queue &q);
#endif
extern int g_ggml_sycl_enable_level_zero;
void ggml_sycl_free_device(void *ptr, sycl::queue &q);

Comment thread ggml/src/ggml-sycl/common.cpp Outdated
Comment on lines +94 to +98
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
ggml_sycl_free_device(extra->data_device[i], *(streams[i]));
#else
SYCL_CHECK(CHECK_TRY_ERROR(sycl::free(extra->data_device[i], *(streams[i]))));
#endif
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
ggml_sycl_free_device(extra->data_device[i], *(streams[i]));
#else
SYCL_CHECK(CHECK_TRY_ERROR(sycl::free(extra->data_device[i], *(streams[i]))));
#endif
SYCL_CHECK(CHECK_TRY_ERROR(ggml_sycl_free_device(extra->data_device[i], *(streams[i]))));

Comment thread ggml/src/ggml-sycl/common.cpp
Comment thread ggml/src/ggml-sycl/common.cpp Outdated
Comment thread ggml/src/ggml-sycl/CMakeLists.txt Outdated
Comment on lines +108 to +109
if (GGML_SYCL_SUPPORT_LEVEL_ZERO)
message(STATUS "GGML_SYCL_SUPPORT_LEVEL_ZERO enabled")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (GGML_SYCL_SUPPORT_LEVEL_ZERO)
message(STATUS "GGML_SYCL_SUPPORT_LEVEL_ZERO enabled")
message(STATUS "GGML_SYCL_SUPPORT_LEVEL_ZERO ${GGML_SYCL_SUPPORT_LEVEL_ZERO}")
if (GGML_SYCL_SUPPORT_LEVEL_ZERO)

@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Apr 9, 2026
@NeoZhangJianyu
Copy link
Copy Markdown
Contributor

Will this need a docs update with the new build variable?

Yes, the SYCL.md is updated to add the discription.

Copy link
Copy Markdown
Contributor

@arthw arthw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the level zero lib is mandatory part of build system by default.
The CI (compile) of SYCL need to install the level zero lib for windows and Ubuntu.
Please update in .github/workflows/build.yml.
Refer to the installation of Intel GPU driver of Windows/Ubuntu.

Here is the example code for reference:
Ubuntu:

 wget -qO - https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | sudo gpg --dearmor --output /usr/share/keyrings/oneapi-archive-keyring.gpg
    echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
    sudo apt-get update
    sudo apt-get install -y level-zero level-zero-devel intel-level-zero-gpu

Windows:

    $release = Invoke-RestMethod -Uri "https://api.github.com/repos/oneapi-src/level-zero/releases/latest"
    $asset = $release.assets | Where-Object { $_.name -like "level-zero-win-sdk*.zip" } | Select-Object -First 1

    Invoke-WebRequest -Uri $asset.browser_download_url -OutFile "level-zero-win-sdk.zip"

    Expand-Archive -Path "level-zero-win-sdk.zip" -DestinationPath "C:\level-zero-sdk" -Force

    # Set environment variables for the build (MSVC / CMake)
    echo "LEVEL_ZERO_INCLUDE_DIR=C:\level-zero-sdk\include" | Out-File -FilePath $env:GITHUB_ENV -Append
    echo "LEVEL_ZERO_LIBRARY_DIR=C:\level-zero-sdk\lib" | Out-File -FilePath $env:GITHUB_ENV -Append
    echo "C:\level-zero-sdk\lib" | Out-File -FilePath $env:GITHUB_PATH -Append   # if needed for runtime DLL

@PMZFX
Copy link
Copy Markdown
Contributor Author

PMZFX commented Apr 11, 2026

Thanks for the guidance and the Windows CI examples, I'll get that updated!

@PMZFX
Copy link
Copy Markdown
Contributor Author

PMZFX commented Apr 13, 2026

Pushed the CI update for Level Zero SDK installation (Ubuntu and Windows), and two additional fixes found during extended dual-GPU testing (no ONEAPI_DEVICE_SELECTOR set):

  1. The Level Zero backend check now skips non-GPU devices. Without this, the OpenCL CPU device was causing Level Zero to be disabled for the GPUs, which defeats the fix on systems that don't set ONEAPI_DEVICE_SELECTOR.

  2. Routed sycl_ext_malloc_device/sycl_ext_free (tensor reorder temp buffers) through the Level Zero allocation path for consistency with the other device memory call sites.

Tested all configurations on dual B70: L0 on (single and dual GPU), L0 off via env var, and GGML_SYCL_SUPPORT_LEVEL_ZERO=OFF build. All clean.

@github-actions github-actions Bot added the devops improvements to build systems and github actions label Apr 13, 2026
@NeoZhangJianyu
Copy link
Copy Markdown
Contributor

@PMZFX
This issue is reported to PyTorch too.
But after check, the root cause is the running time issue: pytorch/pytorch#180145

Dug into this more — I think the driver upgrade was the actual fix, not the allocation path change. The process I measured had been running on compute-runtime 25.18, and the patched binary loaded 26.09. Two things changed at once.

Tested both paths on 26.09 with 264 allocations (4.7GB) — zero host RAM shadow either way. 

I draft a SYCL code to test it, there is no more host memory usage in the running time:

dpkg -l | grep libze-intel-gpu1
ii  libze-intel-gpu1                              26.05.37020.3-1~24.04~ppa3  

Could you check it?

To support level zero API, there are lots additional work load.
Though all work is implemented, we still need to test them.

Compare to level zero memory API, there are new features depend on SYCL memory API: SYCL graph, SVM as I know.

My suggestion is pending this PR or Set SYCL memory API as default.
How do you think?

Thank you!

@PMZFX
Copy link
Copy Markdown
Contributor Author

PMZFX commented Apr 15, 2026

@NeoZhangJianyu You're right. I tested this on clean upstream master (5d14e5d, no PR code) with compute-runtime 26.09.37435.1 and confirmed there's no host RAM shadowing with sycl::malloc_device:

Standalone allocation test: 4 GB sycl::malloc_device per device tile, monitored /proc/self/status VmRSS after each 256 MB chunk. 0% host RSS growth across all 4 tiles.

Real model load: Dual B70 via llama-bench -ngl 99, host RSS unchanged before and after load (~16.4 GB).

I'm glad the fix is in the runtime rather than needing a workaround on our side. A driver fix is cleaner than carrying a parallel allocation path, especially with SYCL graph and SVM on the horizon. Users on older runtimes (pre-26.05) should update libze-intel-gpu1 instead.

Closing this PR. Thanks for the thorough review work to you and @arthw, I learned a lot from the process.

@PMZFX PMZFX closed this Apr 15, 2026
@NeoZhangJianyu
Copy link
Copy Markdown
Contributor

@PMZFX
It's great that you can understand and close this PR.
Level Zero API has better performance than SYCL.
But SYCL provide more additional functions.
In the future, Level Zero API would be used for better performance, please keep this PR and your branch as reference in the future.

Your PRs to support reorder optimization for Q8_0, Q4_K, Q6_K are very good to enhance the performance.
Hope more data types support reorder!

Thank you very much!

@arthw
Copy link
Copy Markdown
Contributor

arthw commented Apr 29, 2026

@PMZFX
Because the PR make the building depend on the level zero SDK/dev lib.
It will impact the CI and release (binary package of windows and Ubuntu).

The windows package includes oneAPI running time.So user can run the binary without installing oneAPI package locally.
We should make sure it is not broken.

For Ubuntu package, it only includes the llama.cpp binary files. It should cooperate with the SYCL/intel docker image work well.

I will help verify them.

Please reopen this PR as draft!
We need a little time to test/review it.

I think it's good to add level-zero API in SYCL backend, through there is a little more complex work.

Thank you!

@PMZFX
Copy link
Copy Markdown
Contributor Author

PMZFX commented May 9, 2026

I triggered the requested fork-side Release and Publish Docker image workflows for the current PR head (8bead87e960bee9fa51d07b36fd01975122b79cd):

Most jobs completed successfully, but both workflows are currently blocked on the upstream ubuntu-24.04-s390x runner label:

  • Release: ubuntu-cpu (s390x, ubuntu-24.04-s390x)
  • Publish Docker image: Push Docker image to Docker Registry (... linux/s390x ...)

Those jobs have stayed queued with no runner assigned, so it looks like this fork may not have access to a runner matching ubuntu-24.04-s390x.

Could you please verify whether the rest of the jobs were successful and sufficient for review, since it looks like the fork can't run those specific jobs?

@CISC
Copy link
Copy Markdown
Member

CISC commented May 9, 2026

Could you please verify whether the rest of the jobs were successful and sufficient for review, since it looks like the fork can't run those specific jobs?

It's fine, I just wanted to see the builds succeed, you can cancel s390x.

@arthw arthw added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label May 12, 2026
@arthw arthw merged commit 9ed6e19 into ggml-org:master May 14, 2026
107 of 113 checks passed
SharkEzz added a commit to SharkEzz/llama.cpp that referenced this pull request May 14, 2026
commit dbe7901
Author: Ruben Ortlam <rortlam@redhat.com>
Date:   Thu May 14 10:36:54 2026 +0200

    vulkan: fix matmul integer pipeline selection (ggml-org#23005)

    * vulkan: fix matmul integer pipeline selection

    * gate pipeline creation with the right bools

commit 320a6a4
Author: Aleksander Grygier <aleksander.grygier@gmail.com>
Date:   Thu May 14 08:09:29 2026 +0200

    fix: Autoscroll detection (ggml-org#23026)

commit 9ed6e19
Author: Katostrofik <georgiopapairo@gmail.com>
Date:   Thu May 14 01:39:14 2026 -0400

    SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations (ggml-org#21597)

    * SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations

    Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation
    in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's
    DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM.
    zeMemAllocDevice uses the SVM/P2P path with no host staging.

    On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model
    consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes.
    With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with
    no performance regression.

    All Level Zero calls include automatic fallback to the original SYCL
    allocation path if Level Zero interop is unavailable.

    * SYCL: address review feedback - remove try/catch, check device types, deduplicate

    - Remove try/catch from malloc/free/memcpy helpers, check backend and
      device type upfront instead (ggml_sycl_is_level_zero, ggml_sycl_is_dgpu)
    - Move shared helpers (is_level_zero, is_dgpu, free_device) to common.cpp
      and declare in common.hpp to eliminate code duplication
    - Use SYCL_CHECK(CHECK_TRY_ERROR()) for fallback sycl::free calls
    - Guard dev2dev_memcpy L0 path to dGPU-to-dGPU only, preserving the
      host-staged path for iGPU-to-dGPU transfers
    - Add Windows Level Zero SDK path detection (LEVEL_ZERO_V1_SDK_PATH)
      in CMakeLists.txt (co-authored with @arthw)

    * SYCL: add build/runtime flags for Level Zero, address review feedback

    Implements the architecture suggested by @arthw: compile-time and runtime
    flags to cleanly separate Level Zero and SYCL memory API paths.

    - Add GGML_SYCL_SUPPORT_LEVEL_ZERO cmake option (default ON). All Level
      Zero code is wrapped in #ifdef so the build works on systems without
      the Level Zero SDK installed (e.g. CPU-only CI servers). Both the
      loader library and headers are checked before enabling.

    - Add GGML_SYCL_ENABLE_LEVEL_ZERO runtime env var (default 1). Controls
      whether Level Zero or SYCL memory APIs are used. Only one API style is
      used per session, no mixing. If Level Zero is enabled but the devices
      don't support the Level Zero backend, it auto-disables with a warning.

    - Remove Level Zero code from dpct_malloc. It was unused (dpct::device_memory
      is not called anywhere in the backend) and used try/catch for flow control.

    - Update SYCL.md with documentation for both new parameters.

    Tested on Intel Arc Pro B70 (32GB), single-GPU and dual-GPU, with both
    GGML_SYCL_SUPPORT_LEVEL_ZERO=ON and OFF builds. AI-assisted development
    (Claude). Code reviewed and tested on my hardware.

    * SYCL: unify Level Zero malloc/free call sites, address review feedback

    Move ggml_sycl_malloc_device to common.cpp alongside ggml_sycl_free_device.
    Both functions are now unconditionally available — Level Zero code is
    uniform SYCL_CHECK(CHECK_TRY_ERROR()) wrapping with no #ifdef blocks.

    Addresses arthw's review: wrap all malloc/free in SYCL_CHECK for stack
    traces on failure, eliminate duplicated #ifdef/else patterns at 6 call
    sites (-29 lines net).

    Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    * SYCL: add Level Zero SDK to CI, fix device check and missed alloc paths

    Add Level Zero SDK installation to Ubuntu and Windows SYCL CI jobs
    so the Level Zero code path is compiled and tested in CI.

    Fix two bugs found during extended dual-GPU testing (no
    ONEAPI_DEVICE_SELECTOR set):

    - The Level Zero backend check was iterating all SYCL devices
      including CPU. The OpenCL CPU device caused Level Zero to be
      disabled for the GPUs, defeating the fix on multi-GPU systems.
      Added is_gpu() filter so only GPU devices are checked.

    - sycl_ext_malloc_device/sycl_ext_free (tensor reorder temp buffers)
      were still calling sycl::malloc/sycl::free directly, bypassing the
      Level Zero path. Routed through ggml_sycl_malloc_device/free_device
      for consistency with the other device memory call sites.

    Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    * SYCL: address arthw review feedback on Level Zero memory API structure

    - Move ggml_sycl_malloc_device to static function in ggml-sycl.cpp;
      only ggml_sycl_free_device (used by common.cpp) stays in common.cpp
    - Switch both helpers to use g_ggml_sycl_enable_level_zero global
      instead of per-call queue backend checks
    - Remove #ifdef wrapper from global definition; always declare at 0,
      add #else branch in init block so it stays 0 when L0 not compiled in
    - Update init loop comment to explain GPU-only device check
    - CMakeLists: message(STATUS) before the if block; align option wording

    AI-assisted implementation. Reviewed and tested on dual Intel Arc Pro
    B70 (32 GB each): test-backend-ops OK on both GPUs, single/dual-GPU
    Q4_K_M and Q8_0 bench correct, zeMemAllocDevice GTT delta confirmed
    <5 MiB per 4 GiB allocation (vs ~4 GiB shadow with sycl::malloc_device).

    Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

    * SYCL: remove unused cstdio/cstdlib includes from common.cpp

    Leftover from the deleted ggml_sycl_queue_supports_level_zero helper.

    Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

    * Apply suggestions from code review

    Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>

    * SYCL: preserve Level Zero allocation path during early malloc

    * ci: fix Level Zero package conflict in Intel Docker build

    * ci: find Level Zero loader in oneAPI package step

    * ci: allow Windows SYCL package without Level Zero DLL

    ---------

    Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
    Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>
xxmustafacooTR pushed a commit to xxPlayground/llama-cpp-turboquant that referenced this pull request May 14, 2026
…ions (ggml-org#21597)

* SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations

Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation
in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's
DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM.
zeMemAllocDevice uses the SVM/P2P path with no host staging.

On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model
consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes.
With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with
no performance regression.

All Level Zero calls include automatic fallback to the original SYCL
allocation path if Level Zero interop is unavailable.

* SYCL: address review feedback - remove try/catch, check device types, deduplicate

- Remove try/catch from malloc/free/memcpy helpers, check backend and
  device type upfront instead (ggml_sycl_is_level_zero, ggml_sycl_is_dgpu)
- Move shared helpers (is_level_zero, is_dgpu, free_device) to common.cpp
  and declare in common.hpp to eliminate code duplication
- Use SYCL_CHECK(CHECK_TRY_ERROR()) for fallback sycl::free calls
- Guard dev2dev_memcpy L0 path to dGPU-to-dGPU only, preserving the
  host-staged path for iGPU-to-dGPU transfers
- Add Windows Level Zero SDK path detection (LEVEL_ZERO_V1_SDK_PATH)
  in CMakeLists.txt (co-authored with @arthw)

* SYCL: add build/runtime flags for Level Zero, address review feedback

Implements the architecture suggested by @arthw: compile-time and runtime
flags to cleanly separate Level Zero and SYCL memory API paths.

- Add GGML_SYCL_SUPPORT_LEVEL_ZERO cmake option (default ON). All Level
  Zero code is wrapped in #ifdef so the build works on systems without
  the Level Zero SDK installed (e.g. CPU-only CI servers). Both the
  loader library and headers are checked before enabling.

- Add GGML_SYCL_ENABLE_LEVEL_ZERO runtime env var (default 1). Controls
  whether Level Zero or SYCL memory APIs are used. Only one API style is
  used per session, no mixing. If Level Zero is enabled but the devices
  don't support the Level Zero backend, it auto-disables with a warning.

- Remove Level Zero code from dpct_malloc. It was unused (dpct::device_memory
  is not called anywhere in the backend) and used try/catch for flow control.

- Update SYCL.md with documentation for both new parameters.

Tested on Intel Arc Pro B70 (32GB), single-GPU and dual-GPU, with both
GGML_SYCL_SUPPORT_LEVEL_ZERO=ON and OFF builds. AI-assisted development
(Claude). Code reviewed and tested on my hardware.

* SYCL: unify Level Zero malloc/free call sites, address review feedback

Move ggml_sycl_malloc_device to common.cpp alongside ggml_sycl_free_device.
Both functions are now unconditionally available — Level Zero code is
#ifdef'd inside the functions, not at call sites. All call sites use
uniform SYCL_CHECK(CHECK_TRY_ERROR()) wrapping with no #ifdef blocks.

Addresses arthw's review: wrap all malloc/free in SYCL_CHECK for stack
traces on failure, eliminate duplicated #ifdef/else patterns at 6 call
sites (-29 lines net).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* SYCL: add Level Zero SDK to CI, fix device check and missed alloc paths

Add Level Zero SDK installation to Ubuntu and Windows SYCL CI jobs
so the Level Zero code path is compiled and tested in CI.

Fix two bugs found during extended dual-GPU testing (no
ONEAPI_DEVICE_SELECTOR set):

- The Level Zero backend check was iterating all SYCL devices
  including CPU. The OpenCL CPU device caused Level Zero to be
  disabled for the GPUs, defeating the fix on multi-GPU systems.
  Added is_gpu() filter so only GPU devices are checked.

- sycl_ext_malloc_device/sycl_ext_free (tensor reorder temp buffers)
  were still calling sycl::malloc/sycl::free directly, bypassing the
  Level Zero path. Routed through ggml_sycl_malloc_device/free_device
  for consistency with the other device memory call sites.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* SYCL: address arthw review feedback on Level Zero memory API structure

- Move ggml_sycl_malloc_device to static function in ggml-sycl.cpp;
  only ggml_sycl_free_device (used by common.cpp) stays in common.cpp
- Switch both helpers to use g_ggml_sycl_enable_level_zero global
  instead of per-call queue backend checks
- Remove #ifdef wrapper from global definition; always declare at 0,
  add #else branch in init block so it stays 0 when L0 not compiled in
- Update init loop comment to explain GPU-only device check
- CMakeLists: message(STATUS) before the if block; align option wording

AI-assisted implementation. Reviewed and tested on dual Intel Arc Pro
B70 (32 GB each): test-backend-ops OK on both GPUs, single/dual-GPU
Q4_K_M and Q8_0 bench correct, zeMemAllocDevice GTT delta confirmed
<5 MiB per 4 GiB allocation (vs ~4 GiB shadow with sycl::malloc_device).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* SYCL: remove unused cstdio/cstdlib includes from common.cpp

Leftover from the deleted ggml_sycl_queue_supports_level_zero helper.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Apply suggestions from code review

Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>

* SYCL: preserve Level Zero allocation path during early malloc

* ci: fix Level Zero package conflict in Intel Docker build

* ci: find Level Zero loader in oneAPI package step

* ci: allow Windows SYCL package without Level Zero DLL

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>
@Stoney49th
Copy link
Copy Markdown

Stoney49th commented May 14, 2026

Hey, with this merged, it gets stuck during model warmup, heres the L0 debug log. last ist and append mem copy and host sync. then, the GPU is stuck at 100%. Setup: Dual B50 in docker container, worked previously with both cards after we reverted the versions.

Log

[35077] [2026-05-14 11:55:59.280] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.280] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.280] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.280] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.280] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.280] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.280] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.280] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.280] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.280] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.280] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.280] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.280] [ze_loader] [trace] zeKernelSetGroupSize(hKernel, groupSizeX, groupSizeY, groupSizeZ)
[35077] [2026-05-14 11:55:59.280] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.280] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.280] [ze_loader] [trace] zeCommandListAppendLaunchKernel(hCommandList, hKernel, pLaunchFuncArgs, hSignalEvent, numWaitEvents, phWaitEventsLocal)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetGroupSize(hKernel, groupSizeX, groupSizeY, groupSizeZ)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeCommandListAppendLaunchKernel(hCommandList, hKernel, pLaunchFuncArgs, hSignalEvent, numWaitEvents, phWaitEventsLocal)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetGroupSize(hKernel, groupSizeX, groupSizeY, groupSizeZ)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeCommandListAppendLaunchKernel(hCommandList, hKernel, pLaunchFuncArgs, hSignalEvent, numWaitEvents, phWaitEventsLocal)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetGroupSize(hKernel, groupSizeX, groupSizeY, groupSizeZ)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeCommandListAppendLaunchKernel(hCommandList, hKernel, pLaunchFuncArgs, hSignalEvent, numWaitEvents, phWaitEventsLocal)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetGroupSize(hKernel, groupSizeX, groupSizeY, groupSizeZ)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeCommandListAppendLaunchKernel(hCommandList, hKernel, pLaunchFuncArgs, hSignalEvent, numWaitEvents, phWaitEventsLocal)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetGroupSize(hKernel, groupSizeX, groupSizeY, groupSizeZ)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeCommandListAppendLaunchKernel(hCommandList, hKernel, pLaunchFuncArgs, hSignalEvent, numWaitEvents, phWaitEventsLocal)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetGroupSize(hKernel, groupSizeX, groupSizeY, groupSizeZ)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeCommandListAppendLaunchKernel(hCommandList, hKernel, pLaunchFuncArgs, hSignalEvent, numWaitEvents, phWaitEventsLocal)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetGroupSize(hKernel, groupSizeX, groupSizeY, groupSizeZ)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeCommandListAppendLaunchKernel(hCommandList, hKernel, pLaunchFuncArgs, hSignalEvent, numWaitEvents, phWaitEventsLocal)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetGroupSize(hKernel, groupSizeX, groupSizeY, groupSizeZ)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeCommandListAppendLaunchKernel(hCommandList, hKernel, pLaunchFuncArgs, hSignalEvent, numWaitEvents, phWaitEventsLocal)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetGroupSize(hKernel, groupSizeX, groupSizeY, groupSizeZ)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeCommandListAppendLaunchKernel(hCommandList, hKernel, pLaunchFuncArgs, hSignalEvent, numWaitEvents, phWaitEventsLocal)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetGroupSize(hKernel, groupSizeX, groupSizeY, groupSizeZ)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeCommandListAppendLaunchKernel(hCommandList, hKernel, pLaunchFuncArgs, hSignalEvent, numWaitEvents, phWaitEventsLocal)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetGroupSize(hKernel, groupSizeX, groupSizeY, groupSizeZ)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeCommandListAppendLaunchKernel(hCommandList, hKernel, pLaunchFuncArgs, hSignalEvent, numWaitEvents, phWaitEventsLocal)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetGroupSize(hKernel, groupSizeX, groupSizeY, groupSizeZ)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeCommandListAppendLaunchKernel(hCommandList, hKernel, pLaunchFuncArgs, hSignalEvent, numWaitEvents, phWaitEventsLocal)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetGroupSize(hKernel, groupSizeX, groupSizeY, groupSizeZ)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeCommandListAppendLaunchKernel(hCommandList, hKernel, pLaunchFuncArgs, hSignalEvent, numWaitEvents, phWaitEventsLocal)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetGroupSize(hKernel, groupSizeX, groupSizeY, groupSizeZ)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeKernelSetArgumentValue(hKernel, argIndex, argSize, pArgValue)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeCommandListAppendLaunchKernel(hCommandList, hKernel, pLaunchFuncArgs, hSignalEvent, numWaitEvents, phWaitEventsLocal)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeMemAllocDevice(hContext, device_desc, size, alignment, hDevice, pptr)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeCommandListAppendMemoryCopy(hCommandList, dstptr, srcptr, size, hSignalEvent, numWaitEvents, phWaitEventsLocal)
[35077] [2026-05-14 11:55:59.281] [ze_loader] [trace] zeEventHostSynchronize(hEvent, timeout)

Since you mentioned P2P, does this require kernel 7.0 and later? Or a newer intel compute runtime version? We had to revert, maybe some stuff is now not matching anymore? This is with #22968 also applied.

@sanmai sanmai mentioned this pull request May 14, 2026
3 tasks
@PMZFX
Copy link
Copy Markdown
Contributor Author

PMZFX commented May 14, 2026

I think this may be a Level Zero peer-copy compatibility issue rather than a problem with the L0 allocation path itself.

#21597 was validated on newer runtimes, e.g. libze-intel-gpu1 26.09.37435.1 in the PR notes. After that, #22968 changed the Intel Docker compute-runtime from 26.14.37833.4 to 25.40.35563.10. The failure trace appears to stop around zeCommandListAppendMemoryCopy, which is the new direct dGPU-to-dGPU path in dev2dev_memcpy().

Current code gates that path on both devices being discrete L0 GPUs, but it does not call zeDeviceCanAccessPeer() before issuing the cross-device copy. On older runtimes/kernel stacks, direct P2P may be unavailable or broken.

Suggested next test:

  1. Run with GGML_SYCL_ENABLE_LEVEL_ZERO=0 to confirm the old SYCL path works.
  2. Test a patch that keeps zeMemAllocDevice() enabled but forces dev2dev_memcpy() to use the host-staged fallback.
  3. If that works, add a zeDeviceCanAccessPeer(dst_dev, src_dev, &can_access) guard before zeCommandListAppendMemoryCopy, falling back to host-staged copy when peer access is unavailable.

@arthw does that sound like the right compatibility boundary? I do not think we should remove the L0 allocation path, because that is what fixes the RAM exhaustion; the risky part seems to be the unconditional L0 peer copy.

@Stoney49th
Copy link
Copy Markdown

I think this may be a Level Zero peer-copy compatibility issue rather than a problem with the L0 allocation path itself.

#21597 was validated on newer runtimes, e.g. libze-intel-gpu1 26.09.37435.1 in the PR notes. After that, #22968 changed the Intel Docker compute-runtime from 26.14.37833.4 to 25.40.35563.10. The failure trace appears to stop around zeCommandListAppendMemoryCopy, which is the new direct dGPU-to-dGPU path in dev2dev_memcpy().

Current code gates that path on both devices being discrete L0 GPUs, but it does not call zeDeviceCanAccessPeer() before issuing the cross-device copy. On older runtimes/kernel stacks, direct P2P may be unavailable or broken.

Suggested next test:

1. Run with `GGML_SYCL_ENABLE_LEVEL_ZERO=0` to confirm the old SYCL path works.

2. Test a patch that keeps `zeMemAllocDevice()` enabled but forces `dev2dev_memcpy()` to use the host-staged fallback.

3. If that works, add a `zeDeviceCanAccessPeer(dst_dev, src_dev, &can_access)` guard before `zeCommandListAppendMemoryCopy`, falling back to host-staged copy when peer access is unavailable.

@arthw does that sound like the right compatibility boundary? I do not think we should remove the L0 allocation path, because that is what fixes the RAM exhaustion; the risky part seems to be the unconditional L0 peer copy.

Thanks for the quick reply, much appreciated!

I'll just merged in both PRs, maybe your addition with the L0 mem backend resolves the issue we faced in general with a regression in the compute runtime - but I need to cherry pick and test in more detail. Do you know if specific newer kernel versions / features are required in general for P2P? Not to hunt ghosts, just because my kernel on the host is too old...currently running 7.0.3-1 on manjaro as the docker host...

I'll start a new debug run later and report back with feedback on your suggested tests.

Thanks for the contributions!

@Stoney49th
Copy link
Copy Markdown

Stoney49th commented May 14, 2026

@PMZFX just to confirm before I start with that action, you are also running a docker-based setup, plain without modifications to the intel.Dockerfile (which are relevant...). So the versions

ARG IGC_VERSION=v2.32.7
ARG IGC_VERSION_FULL=2_2.32.7+21184
ARG COMPUTE_RUNTIME_VERSION=26.14.37833.4
ARG COMPUTE_RUNTIME_VERSION_FULL=26.14.37833.4-0
ARG IGDGMM_VERSION=22.9.0

and

ARG LEVEL_ZERO_VERSION=1.28.2
ARG LEVEL_ZERO_UBUNTU_VERSION=u24.04

did test ok on your end within a container and 2x B70's ?

Edit: Unable to test, running into the same issue as before which triggered #22968.

@sniperwhg
Copy link
Copy Markdown

sniperwhg commented May 15, 2026

Running on 1xB70 and 1xB580, significantly less ram utilization (by about 40G, inline with VRAM allocation).
However, gibberish is now generated when running gemma-4-31B-it-UD-Q6_K_XL.gguf
This was done including the commit with the downgrade of the Intel compute-runtime 0f45f1a.
Rolling back to commit 4c1c3ac resolves this.

Edit:
Moving forward to 42532af seems to work fine from some short testing while maintaining the system memory savings. @PMZFX I think your theory about the compute-run time rollback is correct.

@arthw
Copy link
Copy Markdown
Contributor

arthw commented May 15, 2026

level zero API provide better performance in some cases.
This PR has good value to support level zero API in framework level.

There are two main change of using level zero API to replace SYCL API :

  1. malloc/free device memory.

  2. dev2dev memcopy.

We will test more on different Ubuntu version and stable compute running time.

Please disable this feature in running time by: export GGML_SYCL_ENABLE_LEVEL_ZERO=0, it it bring trouble to you.
If we can't fix the issues in short time, I suggest disable this feature as default.

Thank you!

@Stoney49th
Copy link
Copy Markdown

I tested with the API set to the older fallback version from #22968 and with

environment:
      - GGML_SYCL_ENABLE_LEVEL_ZERO=0

it works - with the new part disabled as desired. Not sure if P2P is the culprit here, since it should be functional in my setup. Is there an easy / straightforward way to test if P2P is actually working, since it seems to be a prerequisite? any program, script, or something I could try? Not really deep enough into the intel GPU stack to do know whats needed to test this...

dandm1 pushed a commit to dandm1/llama.cpp that referenced this pull request May 16, 2026
…ions (ggml-org#21597)

* SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations

Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation
in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's
DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM.
zeMemAllocDevice uses the SVM/P2P path with no host staging.

On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model
consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes.
With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with
no performance regression.

All Level Zero calls include automatic fallback to the original SYCL
allocation path if Level Zero interop is unavailable.

* SYCL: address review feedback - remove try/catch, check device types, deduplicate

- Remove try/catch from malloc/free/memcpy helpers, check backend and
  device type upfront instead (ggml_sycl_is_level_zero, ggml_sycl_is_dgpu)
- Move shared helpers (is_level_zero, is_dgpu, free_device) to common.cpp
  and declare in common.hpp to eliminate code duplication
- Use SYCL_CHECK(CHECK_TRY_ERROR()) for fallback sycl::free calls
- Guard dev2dev_memcpy L0 path to dGPU-to-dGPU only, preserving the
  host-staged path for iGPU-to-dGPU transfers
- Add Windows Level Zero SDK path detection (LEVEL_ZERO_V1_SDK_PATH)
  in CMakeLists.txt (co-authored with @arthw)

* SYCL: add build/runtime flags for Level Zero, address review feedback

Implements the architecture suggested by @arthw: compile-time and runtime
flags to cleanly separate Level Zero and SYCL memory API paths.

- Add GGML_SYCL_SUPPORT_LEVEL_ZERO cmake option (default ON). All Level
  Zero code is wrapped in #ifdef so the build works on systems without
  the Level Zero SDK installed (e.g. CPU-only CI servers). Both the
  loader library and headers are checked before enabling.

- Add GGML_SYCL_ENABLE_LEVEL_ZERO runtime env var (default 1). Controls
  whether Level Zero or SYCL memory APIs are used. Only one API style is
  used per session, no mixing. If Level Zero is enabled but the devices
  don't support the Level Zero backend, it auto-disables with a warning.

- Remove Level Zero code from dpct_malloc. It was unused (dpct::device_memory
  is not called anywhere in the backend) and used try/catch for flow control.

- Update SYCL.md with documentation for both new parameters.

Tested on Intel Arc Pro B70 (32GB), single-GPU and dual-GPU, with both
GGML_SYCL_SUPPORT_LEVEL_ZERO=ON and OFF builds. AI-assisted development
(Claude). Code reviewed and tested on my hardware.

* SYCL: unify Level Zero malloc/free call sites, address review feedback

Move ggml_sycl_malloc_device to common.cpp alongside ggml_sycl_free_device.
Both functions are now unconditionally available — Level Zero code is
#ifdef'd inside the functions, not at call sites. All call sites use
uniform SYCL_CHECK(CHECK_TRY_ERROR()) wrapping with no #ifdef blocks.

Addresses arthw's review: wrap all malloc/free in SYCL_CHECK for stack
traces on failure, eliminate duplicated #ifdef/else patterns at 6 call
sites (-29 lines net).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* SYCL: add Level Zero SDK to CI, fix device check and missed alloc paths

Add Level Zero SDK installation to Ubuntu and Windows SYCL CI jobs
so the Level Zero code path is compiled and tested in CI.

Fix two bugs found during extended dual-GPU testing (no
ONEAPI_DEVICE_SELECTOR set):

- The Level Zero backend check was iterating all SYCL devices
  including CPU. The OpenCL CPU device caused Level Zero to be
  disabled for the GPUs, defeating the fix on multi-GPU systems.
  Added is_gpu() filter so only GPU devices are checked.

- sycl_ext_malloc_device/sycl_ext_free (tensor reorder temp buffers)
  were still calling sycl::malloc/sycl::free directly, bypassing the
  Level Zero path. Routed through ggml_sycl_malloc_device/free_device
  for consistency with the other device memory call sites.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* SYCL: address arthw review feedback on Level Zero memory API structure

- Move ggml_sycl_malloc_device to static function in ggml-sycl.cpp;
  only ggml_sycl_free_device (used by common.cpp) stays in common.cpp
- Switch both helpers to use g_ggml_sycl_enable_level_zero global
  instead of per-call queue backend checks
- Remove #ifdef wrapper from global definition; always declare at 0,
  add #else branch in init block so it stays 0 when L0 not compiled in
- Update init loop comment to explain GPU-only device check
- CMakeLists: message(STATUS) before the if block; align option wording

AI-assisted implementation. Reviewed and tested on dual Intel Arc Pro
B70 (32 GB each): test-backend-ops OK on both GPUs, single/dual-GPU
Q4_K_M and Q8_0 bench correct, zeMemAllocDevice GTT delta confirmed
<5 MiB per 4 GiB allocation (vs ~4 GiB shadow with sycl::malloc_device).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* SYCL: remove unused cstdio/cstdlib includes from common.cpp

Leftover from the deleted ggml_sycl_queue_supports_level_zero helper.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Apply suggestions from code review

Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>

* SYCL: preserve Level Zero allocation path during early malloc

* ci: fix Level Zero package conflict in Intel Docker build

* ci: find Level Zero loader in oneAPI package step

* ci: allow Windows SYCL package without Level Zero DLL

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>
@NeoZhangJianyu
Copy link
Copy Markdown
Contributor

I tested with the API set to the older fallback version from #22968 and with

environment:
      - GGML_SYCL_ENABLE_LEVEL_ZERO=0

it works - with the new part disabled as desired. Not sure if P2P is the culprit here, since it should be functional in my setup. Is there an easy / straightforward way to test if P2P is actually working, since it seems to be a prerequisite? any program, script, or something I could try? Not really deep enough into the intel GPU stack to do know whats needed to test this...

There is no easy method to rollback the P2P code.
You could modify the code to use legacy code for P2P.

Thank you!

rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 19, 2026
…ions (ggml-org#21597)

* SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations

Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation
in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's
DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM.
zeMemAllocDevice uses the SVM/P2P path with no host staging.

On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model
consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes.
With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with
no performance regression.

All Level Zero calls include automatic fallback to the original SYCL
allocation path if Level Zero interop is unavailable.

* SYCL: address review feedback - remove try/catch, check device types, deduplicate

- Remove try/catch from malloc/free/memcpy helpers, check backend and
  device type upfront instead (ggml_sycl_is_level_zero, ggml_sycl_is_dgpu)
- Move shared helpers (is_level_zero, is_dgpu, free_device) to common.cpp
  and declare in common.hpp to eliminate code duplication
- Use SYCL_CHECK(CHECK_TRY_ERROR()) for fallback sycl::free calls
- Guard dev2dev_memcpy L0 path to dGPU-to-dGPU only, preserving the
  host-staged path for iGPU-to-dGPU transfers
- Add Windows Level Zero SDK path detection (LEVEL_ZERO_V1_SDK_PATH)
  in CMakeLists.txt (co-authored with @arthw)

* SYCL: add build/runtime flags for Level Zero, address review feedback

Implements the architecture suggested by @arthw: compile-time and runtime
flags to cleanly separate Level Zero and SYCL memory API paths.

- Add GGML_SYCL_SUPPORT_LEVEL_ZERO cmake option (default ON). All Level
  Zero code is wrapped in #ifdef so the build works on systems without
  the Level Zero SDK installed (e.g. CPU-only CI servers). Both the
  loader library and headers are checked before enabling.

- Add GGML_SYCL_ENABLE_LEVEL_ZERO runtime env var (default 1). Controls
  whether Level Zero or SYCL memory APIs are used. Only one API style is
  used per session, no mixing. If Level Zero is enabled but the devices
  don't support the Level Zero backend, it auto-disables with a warning.

- Remove Level Zero code from dpct_malloc. It was unused (dpct::device_memory
  is not called anywhere in the backend) and used try/catch for flow control.

- Update SYCL.md with documentation for both new parameters.

Tested on Intel Arc Pro B70 (32GB), single-GPU and dual-GPU, with both
GGML_SYCL_SUPPORT_LEVEL_ZERO=ON and OFF builds. AI-assisted development
(Claude). Code reviewed and tested on my hardware.

* SYCL: unify Level Zero malloc/free call sites, address review feedback

Move ggml_sycl_malloc_device to common.cpp alongside ggml_sycl_free_device.
Both functions are now unconditionally available — Level Zero code is
#ifdef'd inside the functions, not at call sites. All call sites use
uniform SYCL_CHECK(CHECK_TRY_ERROR()) wrapping with no #ifdef blocks.

Addresses arthw's review: wrap all malloc/free in SYCL_CHECK for stack
traces on failure, eliminate duplicated #ifdef/else patterns at 6 call
sites (-29 lines net).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* SYCL: add Level Zero SDK to CI, fix device check and missed alloc paths

Add Level Zero SDK installation to Ubuntu and Windows SYCL CI jobs
so the Level Zero code path is compiled and tested in CI.

Fix two bugs found during extended dual-GPU testing (no
ONEAPI_DEVICE_SELECTOR set):

- The Level Zero backend check was iterating all SYCL devices
  including CPU. The OpenCL CPU device caused Level Zero to be
  disabled for the GPUs, defeating the fix on multi-GPU systems.
  Added is_gpu() filter so only GPU devices are checked.

- sycl_ext_malloc_device/sycl_ext_free (tensor reorder temp buffers)
  were still calling sycl::malloc/sycl::free directly, bypassing the
  Level Zero path. Routed through ggml_sycl_malloc_device/free_device
  for consistency with the other device memory call sites.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* SYCL: address arthw review feedback on Level Zero memory API structure

- Move ggml_sycl_malloc_device to static function in ggml-sycl.cpp;
  only ggml_sycl_free_device (used by common.cpp) stays in common.cpp
- Switch both helpers to use g_ggml_sycl_enable_level_zero global
  instead of per-call queue backend checks
- Remove #ifdef wrapper from global definition; always declare at 0,
  add #else branch in init block so it stays 0 when L0 not compiled in
- Update init loop comment to explain GPU-only device check
- CMakeLists: message(STATUS) before the if block; align option wording

AI-assisted implementation. Reviewed and tested on dual Intel Arc Pro
B70 (32 GB each): test-backend-ops OK on both GPUs, single/dual-GPU
Q4_K_M and Q8_0 bench correct, zeMemAllocDevice GTT delta confirmed
<5 MiB per 4 GiB allocation (vs ~4 GiB shadow with sycl::malloc_device).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* SYCL: remove unused cstdio/cstdlib includes from common.cpp

Leftover from the deleted ggml_sycl_queue_supports_level_zero helper.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Apply suggestions from code review

Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>

* SYCL: preserve Level Zero allocation path during early malloc

* ci: fix Level Zero package conflict in Intel Docker build

* ci: find Level Zero loader in oneAPI package step

* ci: allow Windows SYCL package without Level Zero DLL

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request May 19, 2026
…ions (ggml-org#21597)

* SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations

Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation
in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's
DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM.
zeMemAllocDevice uses the SVM/P2P path with no host staging.

On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model
consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes.
With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with
no performance regression.

All Level Zero calls include automatic fallback to the original SYCL
allocation path if Level Zero interop is unavailable.

* SYCL: address review feedback - remove try/catch, check device types, deduplicate

- Remove try/catch from malloc/free/memcpy helpers, check backend and
  device type upfront instead (ggml_sycl_is_level_zero, ggml_sycl_is_dgpu)
- Move shared helpers (is_level_zero, is_dgpu, free_device) to common.cpp
  and declare in common.hpp to eliminate code duplication
- Use SYCL_CHECK(CHECK_TRY_ERROR()) for fallback sycl::free calls
- Guard dev2dev_memcpy L0 path to dGPU-to-dGPU only, preserving the
  host-staged path for iGPU-to-dGPU transfers
- Add Windows Level Zero SDK path detection (LEVEL_ZERO_V1_SDK_PATH)
  in CMakeLists.txt (co-authored with @arthw)

* SYCL: add build/runtime flags for Level Zero, address review feedback

Implements the architecture suggested by @arthw: compile-time and runtime
flags to cleanly separate Level Zero and SYCL memory API paths.

- Add GGML_SYCL_SUPPORT_LEVEL_ZERO cmake option (default ON). All Level
  Zero code is wrapped in #ifdef so the build works on systems without
  the Level Zero SDK installed (e.g. CPU-only CI servers). Both the
  loader library and headers are checked before enabling.

- Add GGML_SYCL_ENABLE_LEVEL_ZERO runtime env var (default 1). Controls
  whether Level Zero or SYCL memory APIs are used. Only one API style is
  used per session, no mixing. If Level Zero is enabled but the devices
  don't support the Level Zero backend, it auto-disables with a warning.

- Remove Level Zero code from dpct_malloc. It was unused (dpct::device_memory
  is not called anywhere in the backend) and used try/catch for flow control.

- Update SYCL.md with documentation for both new parameters.

Tested on Intel Arc Pro B70 (32GB), single-GPU and dual-GPU, with both
GGML_SYCL_SUPPORT_LEVEL_ZERO=ON and OFF builds. AI-assisted development
(Claude). Code reviewed and tested on my hardware.

* SYCL: unify Level Zero malloc/free call sites, address review feedback

Move ggml_sycl_malloc_device to common.cpp alongside ggml_sycl_free_device.
Both functions are now unconditionally available — Level Zero code is
#ifdef'd inside the functions, not at call sites. All call sites use
uniform SYCL_CHECK(CHECK_TRY_ERROR()) wrapping with no #ifdef blocks.

Addresses arthw's review: wrap all malloc/free in SYCL_CHECK for stack
traces on failure, eliminate duplicated #ifdef/else patterns at 6 call
sites (-29 lines net).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* SYCL: add Level Zero SDK to CI, fix device check and missed alloc paths

Add Level Zero SDK installation to Ubuntu and Windows SYCL CI jobs
so the Level Zero code path is compiled and tested in CI.

Fix two bugs found during extended dual-GPU testing (no
ONEAPI_DEVICE_SELECTOR set):

- The Level Zero backend check was iterating all SYCL devices
  including CPU. The OpenCL CPU device caused Level Zero to be
  disabled for the GPUs, defeating the fix on multi-GPU systems.
  Added is_gpu() filter so only GPU devices are checked.

- sycl_ext_malloc_device/sycl_ext_free (tensor reorder temp buffers)
  were still calling sycl::malloc/sycl::free directly, bypassing the
  Level Zero path. Routed through ggml_sycl_malloc_device/free_device
  for consistency with the other device memory call sites.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* SYCL: address arthw review feedback on Level Zero memory API structure

- Move ggml_sycl_malloc_device to static function in ggml-sycl.cpp;
  only ggml_sycl_free_device (used by common.cpp) stays in common.cpp
- Switch both helpers to use g_ggml_sycl_enable_level_zero global
  instead of per-call queue backend checks
- Remove #ifdef wrapper from global definition; always declare at 0,
  add #else branch in init block so it stays 0 when L0 not compiled in
- Update init loop comment to explain GPU-only device check
- CMakeLists: message(STATUS) before the if block; align option wording

AI-assisted implementation. Reviewed and tested on dual Intel Arc Pro
B70 (32 GB each): test-backend-ops OK on both GPUs, single/dual-GPU
Q4_K_M and Q8_0 bench correct, zeMemAllocDevice GTT delta confirmed
<5 MiB per 4 GiB allocation (vs ~4 GiB shadow with sycl::malloc_device).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* SYCL: remove unused cstdio/cstdlib includes from common.cpp

Leftover from the deleted ggml_sycl_queue_supports_level_zero helper.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Apply suggestions from code review

Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>

* SYCL: preserve Level Zero allocation path during early malloc

* ci: fix Level Zero package conflict in Intel Docker build

* ci: find Level Zero loader in oneAPI package step

* ci: allow Windows SYCL package without Level Zero DLL

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
…ions (ggml-org#21597)

* SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations

Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation
in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's
DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM.
zeMemAllocDevice uses the SVM/P2P path with no host staging.

On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model
consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes.
With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with
no performance regression.

All Level Zero calls include automatic fallback to the original SYCL
allocation path if Level Zero interop is unavailable.

* SYCL: address review feedback - remove try/catch, check device types, deduplicate

- Remove try/catch from malloc/free/memcpy helpers, check backend and
  device type upfront instead (ggml_sycl_is_level_zero, ggml_sycl_is_dgpu)
- Move shared helpers (is_level_zero, is_dgpu, free_device) to common.cpp
  and declare in common.hpp to eliminate code duplication
- Use SYCL_CHECK(CHECK_TRY_ERROR()) for fallback sycl::free calls
- Guard dev2dev_memcpy L0 path to dGPU-to-dGPU only, preserving the
  host-staged path for iGPU-to-dGPU transfers
- Add Windows Level Zero SDK path detection (LEVEL_ZERO_V1_SDK_PATH)
  in CMakeLists.txt (co-authored with @arthw)

* SYCL: add build/runtime flags for Level Zero, address review feedback

Implements the architecture suggested by @arthw: compile-time and runtime
flags to cleanly separate Level Zero and SYCL memory API paths.

- Add GGML_SYCL_SUPPORT_LEVEL_ZERO cmake option (default ON). All Level
  Zero code is wrapped in #ifdef so the build works on systems without
  the Level Zero SDK installed (e.g. CPU-only CI servers). Both the
  loader library and headers are checked before enabling.

- Add GGML_SYCL_ENABLE_LEVEL_ZERO runtime env var (default 1). Controls
  whether Level Zero or SYCL memory APIs are used. Only one API style is
  used per session, no mixing. If Level Zero is enabled but the devices
  don't support the Level Zero backend, it auto-disables with a warning.

- Remove Level Zero code from dpct_malloc. It was unused (dpct::device_memory
  is not called anywhere in the backend) and used try/catch for flow control.

- Update SYCL.md with documentation for both new parameters.

Tested on Intel Arc Pro B70 (32GB), single-GPU and dual-GPU, with both
GGML_SYCL_SUPPORT_LEVEL_ZERO=ON and OFF builds. AI-assisted development
(Claude). Code reviewed and tested on my hardware.

* SYCL: unify Level Zero malloc/free call sites, address review feedback

Move ggml_sycl_malloc_device to common.cpp alongside ggml_sycl_free_device.
Both functions are now unconditionally available — Level Zero code is
#ifdef'd inside the functions, not at call sites. All call sites use
uniform SYCL_CHECK(CHECK_TRY_ERROR()) wrapping with no #ifdef blocks.

Addresses arthw's review: wrap all malloc/free in SYCL_CHECK for stack
traces on failure, eliminate duplicated #ifdef/else patterns at 6 call
sites (-29 lines net).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* SYCL: add Level Zero SDK to CI, fix device check and missed alloc paths

Add Level Zero SDK installation to Ubuntu and Windows SYCL CI jobs
so the Level Zero code path is compiled and tested in CI.

Fix two bugs found during extended dual-GPU testing (no
ONEAPI_DEVICE_SELECTOR set):

- The Level Zero backend check was iterating all SYCL devices
  including CPU. The OpenCL CPU device caused Level Zero to be
  disabled for the GPUs, defeating the fix on multi-GPU systems.
  Added is_gpu() filter so only GPU devices are checked.

- sycl_ext_malloc_device/sycl_ext_free (tensor reorder temp buffers)
  were still calling sycl::malloc/sycl::free directly, bypassing the
  Level Zero path. Routed through ggml_sycl_malloc_device/free_device
  for consistency with the other device memory call sites.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* SYCL: address arthw review feedback on Level Zero memory API structure

- Move ggml_sycl_malloc_device to static function in ggml-sycl.cpp;
  only ggml_sycl_free_device (used by common.cpp) stays in common.cpp
- Switch both helpers to use g_ggml_sycl_enable_level_zero global
  instead of per-call queue backend checks
- Remove #ifdef wrapper from global definition; always declare at 0,
  add #else branch in init block so it stays 0 when L0 not compiled in
- Update init loop comment to explain GPU-only device check
- CMakeLists: message(STATUS) before the if block; align option wording

AI-assisted implementation. Reviewed and tested on dual Intel Arc Pro
B70 (32 GB each): test-backend-ops OK on both GPUs, single/dual-GPU
Q4_K_M and Q8_0 bench correct, zeMemAllocDevice GTT delta confirmed
<5 MiB per 4 GiB allocation (vs ~4 GiB shadow with sycl::malloc_device).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* SYCL: remove unused cstdio/cstdlib includes from common.cpp

Leftover from the deleted ggml_sycl_queue_supports_level_zero helper.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Apply suggestions from code review

Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>

* SYCL: preserve Level Zero allocation path during early malloc

* ci: fix Level Zero package conflict in Intel Docker build

* ci: find Level Zero loader in oneAPI package step

* ci: allow Windows SYCL package without Level Zero DLL

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>
winstonma pushed a commit to winstonma/llama.cpp that referenced this pull request May 27, 2026
…ions (ggml-org#21597)

* SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations

Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation
in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's
DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM.
zeMemAllocDevice uses the SVM/P2P path with no host staging.

On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model
consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes.
With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with
no performance regression.

All Level Zero calls include automatic fallback to the original SYCL
allocation path if Level Zero interop is unavailable.

* SYCL: address review feedback - remove try/catch, check device types, deduplicate

- Remove try/catch from malloc/free/memcpy helpers, check backend and
  device type upfront instead (ggml_sycl_is_level_zero, ggml_sycl_is_dgpu)
- Move shared helpers (is_level_zero, is_dgpu, free_device) to common.cpp
  and declare in common.hpp to eliminate code duplication
- Use SYCL_CHECK(CHECK_TRY_ERROR()) for fallback sycl::free calls
- Guard dev2dev_memcpy L0 path to dGPU-to-dGPU only, preserving the
  host-staged path for iGPU-to-dGPU transfers
- Add Windows Level Zero SDK path detection (LEVEL_ZERO_V1_SDK_PATH)
  in CMakeLists.txt (co-authored with @arthw)

* SYCL: add build/runtime flags for Level Zero, address review feedback

Implements the architecture suggested by @arthw: compile-time and runtime
flags to cleanly separate Level Zero and SYCL memory API paths.

- Add GGML_SYCL_SUPPORT_LEVEL_ZERO cmake option (default ON). All Level
  Zero code is wrapped in #ifdef so the build works on systems without
  the Level Zero SDK installed (e.g. CPU-only CI servers). Both the
  loader library and headers are checked before enabling.

- Add GGML_SYCL_ENABLE_LEVEL_ZERO runtime env var (default 1). Controls
  whether Level Zero or SYCL memory APIs are used. Only one API style is
  used per session, no mixing. If Level Zero is enabled but the devices
  don't support the Level Zero backend, it auto-disables with a warning.

- Remove Level Zero code from dpct_malloc. It was unused (dpct::device_memory
  is not called anywhere in the backend) and used try/catch for flow control.

- Update SYCL.md with documentation for both new parameters.

Tested on Intel Arc Pro B70 (32GB), single-GPU and dual-GPU, with both
GGML_SYCL_SUPPORT_LEVEL_ZERO=ON and OFF builds. AI-assisted development
(Claude). Code reviewed and tested on my hardware.

* SYCL: unify Level Zero malloc/free call sites, address review feedback

Move ggml_sycl_malloc_device to common.cpp alongside ggml_sycl_free_device.
Both functions are now unconditionally available — Level Zero code is
#ifdef'd inside the functions, not at call sites. All call sites use
uniform SYCL_CHECK(CHECK_TRY_ERROR()) wrapping with no #ifdef blocks.

Addresses arthw's review: wrap all malloc/free in SYCL_CHECK for stack
traces on failure, eliminate duplicated #ifdef/else patterns at 6 call
sites (-29 lines net).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* SYCL: add Level Zero SDK to CI, fix device check and missed alloc paths

Add Level Zero SDK installation to Ubuntu and Windows SYCL CI jobs
so the Level Zero code path is compiled and tested in CI.

Fix two bugs found during extended dual-GPU testing (no
ONEAPI_DEVICE_SELECTOR set):

- The Level Zero backend check was iterating all SYCL devices
  including CPU. The OpenCL CPU device caused Level Zero to be
  disabled for the GPUs, defeating the fix on multi-GPU systems.
  Added is_gpu() filter so only GPU devices are checked.

- sycl_ext_malloc_device/sycl_ext_free (tensor reorder temp buffers)
  were still calling sycl::malloc/sycl::free directly, bypassing the
  Level Zero path. Routed through ggml_sycl_malloc_device/free_device
  for consistency with the other device memory call sites.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* SYCL: address arthw review feedback on Level Zero memory API structure

- Move ggml_sycl_malloc_device to static function in ggml-sycl.cpp;
  only ggml_sycl_free_device (used by common.cpp) stays in common.cpp
- Switch both helpers to use g_ggml_sycl_enable_level_zero global
  instead of per-call queue backend checks
- Remove #ifdef wrapper from global definition; always declare at 0,
  add #else branch in init block so it stays 0 when L0 not compiled in
- Update init loop comment to explain GPU-only device check
- CMakeLists: message(STATUS) before the if block; align option wording

AI-assisted implementation. Reviewed and tested on dual Intel Arc Pro
B70 (32 GB each): test-backend-ops OK on both GPUs, single/dual-GPU
Q4_K_M and Q8_0 bench correct, zeMemAllocDevice GTT delta confirmed
<5 MiB per 4 GiB allocation (vs ~4 GiB shadow with sycl::malloc_device).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* SYCL: remove unused cstdio/cstdlib includes from common.cpp

Leftover from the deleted ggml_sycl_queue_supports_level_zero helper.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Apply suggestions from code review

Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>

* SYCL: preserve Level Zero allocation path during early malloc

* ci: fix Level Zero package conflict in Intel Docker build

* ci: find Level Zero loader in oneAPI package step

* ci: allow Windows SYCL package without Level Zero DLL

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
…ions (ggml-org#21597)

* SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations

Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation
in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's
DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM.
zeMemAllocDevice uses the SVM/P2P path with no host staging.

On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model
consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes.
With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with
no performance regression.

All Level Zero calls include automatic fallback to the original SYCL
allocation path if Level Zero interop is unavailable.

* SYCL: address review feedback - remove try/catch, check device types, deduplicate

- Remove try/catch from malloc/free/memcpy helpers, check backend and
  device type upfront instead (ggml_sycl_is_level_zero, ggml_sycl_is_dgpu)
- Move shared helpers (is_level_zero, is_dgpu, free_device) to common.cpp
  and declare in common.hpp to eliminate code duplication
- Use SYCL_CHECK(CHECK_TRY_ERROR()) for fallback sycl::free calls
- Guard dev2dev_memcpy L0 path to dGPU-to-dGPU only, preserving the
  host-staged path for iGPU-to-dGPU transfers
- Add Windows Level Zero SDK path detection (LEVEL_ZERO_V1_SDK_PATH)
  in CMakeLists.txt (co-authored with @arthw)

* SYCL: add build/runtime flags for Level Zero, address review feedback

Implements the architecture suggested by @arthw: compile-time and runtime
flags to cleanly separate Level Zero and SYCL memory API paths.

- Add GGML_SYCL_SUPPORT_LEVEL_ZERO cmake option (default ON). All Level
  Zero code is wrapped in #ifdef so the build works on systems without
  the Level Zero SDK installed (e.g. CPU-only CI servers). Both the
  loader library and headers are checked before enabling.

- Add GGML_SYCL_ENABLE_LEVEL_ZERO runtime env var (default 1). Controls
  whether Level Zero or SYCL memory APIs are used. Only one API style is
  used per session, no mixing. If Level Zero is enabled but the devices
  don't support the Level Zero backend, it auto-disables with a warning.

- Remove Level Zero code from dpct_malloc. It was unused (dpct::device_memory
  is not called anywhere in the backend) and used try/catch for flow control.

- Update SYCL.md with documentation for both new parameters.

Tested on Intel Arc Pro B70 (32GB), single-GPU and dual-GPU, with both
GGML_SYCL_SUPPORT_LEVEL_ZERO=ON and OFF builds. AI-assisted development
(Claude). Code reviewed and tested on my hardware.

* SYCL: unify Level Zero malloc/free call sites, address review feedback

Move ggml_sycl_malloc_device to common.cpp alongside ggml_sycl_free_device.
Both functions are now unconditionally available — Level Zero code is
#ifdef'd inside the functions, not at call sites. All call sites use
uniform SYCL_CHECK(CHECK_TRY_ERROR()) wrapping with no #ifdef blocks.

Addresses arthw's review: wrap all malloc/free in SYCL_CHECK for stack
traces on failure, eliminate duplicated #ifdef/else patterns at 6 call
sites (-29 lines net).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* SYCL: add Level Zero SDK to CI, fix device check and missed alloc paths

Add Level Zero SDK installation to Ubuntu and Windows SYCL CI jobs
so the Level Zero code path is compiled and tested in CI.

Fix two bugs found during extended dual-GPU testing (no
ONEAPI_DEVICE_SELECTOR set):

- The Level Zero backend check was iterating all SYCL devices
  including CPU. The OpenCL CPU device caused Level Zero to be
  disabled for the GPUs, defeating the fix on multi-GPU systems.
  Added is_gpu() filter so only GPU devices are checked.

- sycl_ext_malloc_device/sycl_ext_free (tensor reorder temp buffers)
  were still calling sycl::malloc/sycl::free directly, bypassing the
  Level Zero path. Routed through ggml_sycl_malloc_device/free_device
  for consistency with the other device memory call sites.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* SYCL: address arthw review feedback on Level Zero memory API structure

- Move ggml_sycl_malloc_device to static function in ggml-sycl.cpp;
  only ggml_sycl_free_device (used by common.cpp) stays in common.cpp
- Switch both helpers to use g_ggml_sycl_enable_level_zero global
  instead of per-call queue backend checks
- Remove #ifdef wrapper from global definition; always declare at 0,
  add #else branch in init block so it stays 0 when L0 not compiled in
- Update init loop comment to explain GPU-only device check
- CMakeLists: message(STATUS) before the if block; align option wording

AI-assisted implementation. Reviewed and tested on dual Intel Arc Pro
B70 (32 GB each): test-backend-ops OK on both GPUs, single/dual-GPU
Q4_K_M and Q8_0 bench correct, zeMemAllocDevice GTT delta confirmed
<5 MiB per 4 GiB allocation (vs ~4 GiB shadow with sycl::malloc_device).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* SYCL: remove unused cstdio/cstdlib includes from common.cpp

Leftover from the deleted ggml_sycl_queue_supports_level_zero helper.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Apply suggestions from code review

Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>

* SYCL: preserve Level Zero allocation path during early malloc

* ci: fix Level Zero package conflict in Intel Docker build

* ci: find Level Zero loader in oneAPI package step

* ci: allow Windows SYCL package without Level Zero DLL

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops improvements to build systems and github actions documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants